an introduction to mechanistic interpretability
In the last lab you trained a tiny model. It works — but it's a black box of numbers. This lab opens it up. We'll recover the actual algorithm hiding in the weights, in plain words. That's mechanistic interpretability.
Nobody programmed our model's rules by hand — training discovered them and stored them in a grid of weights. Mechanistic interpretability is the craft of reading those weights back out as something a human can understand: features, rules, circuits. Not “the model is 91% accurate,” but “here is exactly what it computes, and why.”
Real models are dauntingly large, so the field starts where we can win: the smallest model that still does something. Ours is the simplest possible — a single matrix — which means we can decode it completely. The perfect first patient.
Pick a letter. The model's entire opinion about “what comes next” is a single row of the weight matrix, turned into probabilities. We don't have to run the model — we can just read it off the page. And look: it matches what the data actually does. That's a learned feature, caught in the open.
Looking isn't proof. To show a weight causes a behavior, you delete it and see what breaks — an ablation. Knock out the column that lets words end and the model never shuts up. The dark stripe in the matrix is the damage; the words below are the symptom.
Rank every row by its most confident prediction and you get the model's rulebook, in English. This is the whole point of interpretability: turning a wall of numbers into statements you can check.
If it always took its single most likely letter, never gambling, it would write exactly one word, forever:
That single deterministic trace is the model's backbone — temperature and sampling just let it wander off this path.
Everything you just did — reading weights, ablating, extracting rules — is exactly how researchers study GPT-scale models. Only the objects get richer. Here's the vocabulary, mapped to math you already own.
A concept — “this is code,” “this is in French” — is a direction in the model's activation vectors. To test for it, you take a dot product (Chapter 4). Interpretability is largely the hunt for meaningful directions.
Every layer reads from and writes to one shared vector per token — a running notebook. Each matrix (Chapter 5) edits it a little. The “logit lens” you used is just reading that notebook partway through.
The first circuit ever fully reverse-engineered: an attention head that finds the previous place the current token appeared and copies whatever came next. It's how a model that saw “X Y” earlier, on meeting “X” again, predicts “Y” — copy what followed last time. Pure dot-product matching (Ch.4) plus softmax (Ch.6) — the attention you met in Chapter 7.
Models cram far more features than they have dimensions, so a single neuron lights up for many unrelated things. Our one clean matrix was a gift; real weights are tangled, which is why tools like sparse autoencoders exist to pull the features apart.
You just opened a model and explained what it does. That's the whole job.
The same three moves — read the weights, intervene to prove cause, extract the rule — scale all the way up to making frontier models honest and safe. The toy here is small; the method is the real thing.